Abstract: In today’s era of World Wide Web, there is a tremendous proliferation in the amount of digitized text documents. As there is huge collection of documents on the web, there is a need of grouping the set of documents into clusters. Document clustering plays an important role in effectively navigating and organizing the documents. The k-means clustering algorithm is the most commonly document clustering algorithm, it takes less computation time than a matrix-based clustering algorithm. The major problem with this algorithm is that it is quite sensitive to selection of initial cluster centroids. This article proposed a hybrid Genetic K-means clustering algorithm that improves the quality of clusters. Further, author has also performs a comparisons of hybrid algorithm and k-means algorithm on two different text document dataset. The experimental results show that the proposed method is more effective and converge to more accurate clusters than previous method.
Keywords: Document Clustering, Cosine Similarity, k-means, Genetic Algorithm, Purity measure.